This quickstart guide explains how to match two tables using Magellan. Our goal is to come up with a workflow to match DBLP and ACM datasets. Specifically, we want to achieve precision greater than 95% and get recall as high as possible. The datasets contain information about the conference papers published in top databse conferences.

First, we need to import the Magellan package as follows:



In [1]:

    
# Import libraries
import py_entitymatching as em
import pandas as pd
import os, sys



In [2]:

    
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )









    



python version: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) 
[GCC 7.2.0]
pandas version: 0.21.0
magellan version: 0.2.0

Matching two tables typically consists of the following three steps:

Loading the input tables
Blocking the input tables to get a candidate set
Matching the tuple pairs in the candidate set

1. Loading the input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.



In [3]:

    
#dblp_dataset_path = os.sep.join(['DBLP_ACM', 'DBLP_cleaned.csv'])
#acm_dataset_path = os.sep.join(['DBLP_ACM', 'ACM_cleaned.csv'])

datasets_dir = em.get_install_path() + os.sep + 'datasets'

dblp_dataset_path = datasets_dir + os.sep + 'DBLP.csv'
acm_dataset_path = datasets_dir + os.sep + 'ACM.csv'



In [4]:

    
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(dblp_dataset_path, key='id')
B = em.read_csv_metadata(acm_dataset_path, key='id')









    



Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.



In [5]:

    
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))









    



Number of tuples in A: 2616
Number of tuples in B: 2294
Number of tuples in A X B (i.e the cartesian product): 6001104



In [6]:

    
A.head(2)









    Out[6]:







  
    
      
      id
      title
      authors
      venue
      year
    
  
  
    
      0
      journals/sigmod/Mackay99
      Semantic Integration of Environmental Models for Application to Global Information Systems and D...
      D. Scott Mackay
      SIGMOD Record
      1999.0
    
    
      1
      conf/vldb/PoosalaI96
      Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing
      Viswanath Poosala, Yannis E. Ioannidis
      VLDB
      1996.0



In [7]:

    
B.head(2)









    Out[7]:







  
    
      
      id
      title
      authors
      venue
      year
    
  
  
    
      0
      304586
      The WASA2 object-oriented workflow management system
      Gottfried Vossen, Mathias Weske
      International Conference on Management of Data
      1999
    
    
      1
      304587
      A user-centered interface for querying distributed multimedia databases
      Isabel F. Cruz, Kimberly M. James
      International Conference on Management of Data
      1999



In [8]:

    
# Display the key attributes of table A and B.
em.get_key(A), em.get_key(B)









    Out[8]:





('id', 'id')



In [9]:

    
# If the tables are large we can downsample the tables like this
A1, B1 = em.down_sample(A, B, 500, 1)
# But for the demo, we will use the entire table A and B









    



0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00

2. Blocking to create candidate tuple pairs

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.

Magellan provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Refer to [api reference] for more details. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two conference papers published in different years cannot match. So, we decide to apply an attribute equivelance blocker on the 'year' attribute.



In [10]:

    
# Plan
# A, B ------ attribute equivalence [year] -----> C1



In [11]:

    
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block tables using 'year' attribute: same year then include in the canidate set
C1 = ab.block_tables(A, B, 'year', 'year', 
                   l_output_attrs=['title', 'authors', 'year'],
                   r_output_attrs=['title', 'authors', 'year']
                   )



In [12]:

    
# Check the number of rows in C1
len(C1)









    Out[12]:





577024



In [13]:

    
# Display first two rows from C1
C1.head(2)









    Out[13]:







  
    
      
      _id
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_year
      rtable_title
      rtable_authors
      rtable_year
    
  
  
    
      0
      0
      journals/sigmod/Mackay99
      304586
      Semantic Integration of Environmental Models for Application to Global Information Systems and D...
      D. Scott Mackay
      1999
      The WASA2 object-oriented workflow management system
      Gottfried Vossen, Mathias Weske
      1999
    
    
      1
      1
      journals/sigmod/Mackay99
      304587
      Semantic Integration of Environmental Models for Application to Global Information Systems and D...
      D. Scott Mackay
      1999
      A user-centered interface for querying distributed multimedia databases
      Isabel F. Cruz, Kimberly M. James
      1999

The number of tuple pairs considered for matching is reduced to 601284 (from 6001104), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in Magellan as follows:



In [14]:

    
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)



In [15]:

    
# Display first few tuple pairs from the debug_blocker's output
dbg.head()









    Out[15]:







  
    
      
      _id
      similarity
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_venue
      rtable_title
      rtable_authors
      rtable_venue
    
  
  
    
      0
      0
      0.950000
      journals/sigmod/KoubarakisTID03
      945736
      Selective information dissemination in P2P networks: problems and solutions
      Stratos Idreos, Manolis Koubarakis, Christos Tryfonopoulos, Yannis Drougas
      SIGMOD Record
      Selective information dissemination in P2P networks: problems and solutions
      Manolis Koubarakis, Christos Tryfonopoulos, Stratos Idreos, Yannis Drougas
      ACM SIGMOD Record
    
    
      1
      1
      0.941176
      journals/sigmod/LabrinidisR00
      344794
      Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl
      Alexandros Labrinidis, Nick Roussopoulos
      SIGMOD Record
      Generating dynamic content at database-backed web servers: cgi-bin vs. mod_perl
      Alexandros Labrinidis, Nick Roussopoulos
      ACM SIGMOD Record
    
    
      2
      2
      0.923077
      journals/sigmod/Chen94
      187460
      Database Research at NTHU and ITRI
      Arbee L. P. Chen
      SIGMOD Record
      Database research at NTHU and ITRI
      Arbee L. P. Chen
      ACM SIGMOD Record
    
    
      3
      3
      0.909091
      journals/sigmod/MedeirosP94
      181566
      Databases for GIS
      Claudia Bauzer Medeiros, Fatima Pires
      SIGMOD Record
      Databases for GIS
      Claudia Bauzer Medeiros, Fatima Pires
      ACM SIGMOD Record
    
    
      4
      4
      0.900000
      journals/sigmod/Kosch02
      565123
      MPEG-7 and Multimedia Database Systems
      Harald Kosch
      SIGMOD Record
      MPEG-7 and multimedia database systems
      Harald Kosch
      ACM SIGMOD Record

From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.

For the considered dataset, we know that for the conference papers to match the author names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.



In [16]:

    
# Updated blocking sequence
# A, B ------ attribute equivalence [year] -----> C1--
#                                                     |----> C
# A, B ------ overlap blocker [authors] --------> C2--



In [17]:

    
# Create an overlap blocker
ob = em.OverlapBlocker()
# Apply overlap blocker on 'authors' attribute
C2 = ob.block_tables(A, B, 'authors', 'authors', 
                   l_output_attrs=['title', 'authors', 'year'],
                   r_output_attrs=['title', 'authors', 'year']
                   )









    



0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [18]:

    
# Check the number of rows in C2
len(C2)









    Out[18]:





316015



In [19]:

    
# Display first two rows from C2
C2.head(2)









    Out[19]:







  
    
      
      _id
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_year
      rtable_title
      rtable_authors
      rtable_year
    
  
  
    
      0
      0
      journals/tods/LechtenborgerV03
      304586
      On the computation of relational view complements
      Gottfried Vossen, Jens Lechtenbrger
      2003.0
      The WASA2 object-oriented workflow management system
      Gottfried Vossen, Mathias Weske
      1999
    
    
      1
      1
      journals/sigmod/CherniakV03
      304586
      Reminiscences on Influential Papers
      Mitch Cherniack, Kenneth A. Ross, Gottfried Vossen
      2003.0
      The WASA2 object-oriented workflow management system
      Gottfried Vossen, Mathias Weske
      1999



In [20]:

    
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1, C2])



In [21]:

    
# Check the number of rows in the consolidated candidate set.
len(C)









    Out[21]:





860645



In [22]:

    
# Debug again
dbg = em.debug_blocker(C, A, B)



In [23]:

    
# Display first few rows from the debugger output
dbg.head(3)









    Out[23]:







  
    
      
      _id
      similarity
      ltable_id
      rtable_id
      ltable_title
      ltable_authors
      ltable_venue
      rtable_title
      rtable_authors
      rtable_venue
    
  
  
    
      0
      0
      0.736842
      journals/sigmod/ACT-NET96
      234896
      ACT-NET - The Active Database Management System Manifesto: A Rulebase of ADBMS Features
      ?
      SIGMOD Record
      The active database management system manifesto: a rulebase of ADBMS features
      Corporate Act-Net Consortium
      ACM SIGMOD Record
    
    
      1
      1
      0.571429
      conf/vldb/Team00
      671838
      High-Performance and Scalability through Application Tier,In-Memory Data Management
      ?
      VLDB
      High-Performance and Scalability through Application Tier,In-Memory Data Management
      NaN
      Very Large Data Bases
    
    
      2
      2
      0.500000
      journals/sigmod/Dogac98
      945727
      Guest Editor's Introduction
      Asuman Dogac
      SIGMOD Record
      Guest editor's introduction
      Karl Aberer
      ACM SIGMOD Record

We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.

3. Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.

This typically involves the following five steps:

Sampling and labeling the candidate set
Splitting the labeled data into development and evaluation set
Selecting the best learning based matcher using the development set
Evaluating the selected matcher using the evaluation set

3.1 Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.



In [24]:

    
# Sample candidate set
S = em.sample_table(C, 450)

Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.



In [25]:

    
# Label S and specify the attribute name for the label column
L = em.label_table(S, 'gold')









    



Column name (gold) is not present in dataframe
/u/p/m/pmartinkus/Documents/Research/py_entitymatching/py_entitymatching/gui/table_gui.py:94: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  table.set_value(idxv[i], cols[j], val)

For the purposes of this guide, we will load in a pre-labeled dataset (of 415 tuple pairs) included in this package.



In [26]:

    
# Load the pre-labeled data
labeled_dataset_path = datasets_dir + os.sep + 'dblp_acm_demo_labels.csv'
L = em.read_csv_metadata(labeled_dataset_path, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable.id', fk_rtable='rtable.id')

# Display the number of rows in the labaled data set
len(L)









    



Metadata file is not present in the given path; proceeding to read the csv file.






    Out[26]:





415

3.2 Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development and evaluation. Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.



In [27]:

    
# Split the labeled data into development and evaluation set
development_evaluation = em.split_train_test(L, train_proportion=0.7)
development =  development_evaluation['train']
evaluation = development_evaluation['test']

3.3 Select the best learning-based matcher

Selecting the best learning-based matcher typically involves the following steps:

Creating a set of learning-based matchers
Creating features
Extracting feature vectors
Selecting the best learning-based matcher using k-fold cross validation
Debugging the matcher (and possibly repeat the above steps)

3.3.1 Creating a set of learning-based matchers

First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.



In [28]:

    
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree')
svm = em.SVMMatcher(name='SVM')
rf = em.RFMatcher(name='RF')
nb = em.NBMatcher(name='NB')
lg = em.LogRegMatcher(name='LogReg')
ln = em.LinRegMatcher(name='LinReg')

3.3.2 Creating features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.



In [29]:

    
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)



In [30]:

    
# List the names of the features generated
feature_table['feature_name']









    Out[30]:





0             title_title_jac_qgm_3_qgm_3
1         title_title_cos_dlm_dc0_dlm_dc0
2                         title_title_mel
3                    title_title_lev_dist
4                     title_title_lev_sim
5         authors_authors_jac_qgm_3_qgm_3
6     authors_authors_cos_dlm_dc0_dlm_dc0
7                     authors_authors_mel
8                authors_authors_lev_dist
9                 authors_authors_lev_sim
10                          year_year_exm
11                          year_year_anm
12                     year_year_lev_dist
13                      year_year_lev_sim
Name: feature_name, dtype: object



In [31]:

    
# Select the year related features
feature_subset_iter1 = feature_table[10:14]



In [32]:

    
# List the names of the features selected
feature_subset_iter1['feature_name']









    Out[32]:





10         year_year_exm
11         year_year_anm
12    year_year_lev_dist
13     year_year_lev_sim
Name: feature_name, dtype: object

3.3.3 Extracting feature vectors

In this step, we extract feature vectors using the development set and the created features.



In [33]:

    
# Extract feature vectors
feature_vectors_dev = em.extract_feature_vecs(development, 
                            feature_table=feature_subset_iter1, 
                            attrs_after='gold')









    



0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [34]:

    
# Display first few rows
feature_vectors_dev.head(3)









    Out[34]:







  
    
      
      _id
      ltable.id
      rtable.id
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      gold
    
  
  
    
      31
      31
      conf/sigmod/Chamberlin03
      872877
      1.0
      1.0
      2.0
      0.666667
      1
    
    
      180
      180
      conf/vldb/GoldmanSVG98
      671346
      1.0
      1.0
      2.0
      0.666667
      1
    
    
      163
      163
      conf/vldb/ChaudhuriW00
      671696
      1.0
      1.0
      2.0
      0.666667
      1

Next, we might have to impute the feature vectors as it might contain missing values. First, let us check if there are any missing values in the extracted feature vectors.



In [35]:

    
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.isnull(feature_vectors_dev))









    Out[35]:





True

We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.



In [36]:

    
# Impute feature vectors with the mean of the column values.
feature_vectors_dev = em.impute_table(feature_vectors_dev, 
                exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
                strategy='mean')

3.3.4 Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher.



In [37]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
        k=10,
        target_attr='gold', 
        metric_to_select_matcher='precision',
        random_state=0)



In [38]:

    
result['cv_stats']









    Out[38]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      0.775601
      1.000000
      0.870943
    
    
      1
      RF
      0.775601
      1.000000
      0.870943
    
    
      2
      SVM
      0.775601
      1.000000
      0.870943
    
    
      3
      NB
      0.781584
      0.970992
      0.863488
    
    
      4
      LogReg
      0.781584
      0.970992
      0.863488
    
    
      5
      LinReg
      0.775601
      1.000000
      0.870943



In [39]:

    
result['drill_down_cv_stats']['precision']









    Out[39]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Fold 6
      Fold 7
      Fold 8
      Fold 9
      Fold 10
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x7f95caa34710>
      10
      0.857143
      0.807692
      0.782609
      0.777778
      0.636364
      0.6875
      0.923077
      0.714286
      0.869565
      0.700000
      0.775601
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x7f95caa347f0>
      10
      0.857143
      0.807692
      0.782609
      0.777778
      0.636364
      0.6875
      0.923077
      0.714286
      0.869565
      0.700000
      0.775601
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x7f95caa34748>
      10
      0.857143
      0.807692
      0.782609
      0.777778
      0.636364
      0.6875
      0.923077
      0.714286
      0.869565
      0.700000
      0.775601
    
    
      3
      NB
      <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x7f95caa34860>
      10
      0.900000
      0.807692
      0.772727
      0.777778
      0.650000
      0.6875
      0.920000
      0.714286
      0.863636
      0.722222
      0.781584
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x7f95caa34908>
      10
      0.900000
      0.807692
      0.772727
      0.777778
      0.650000
      0.6875
      0.920000
      0.714286
      0.863636
      0.722222
      0.781584
    
    
      5
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x7f95caa34978>
      10
      0.857143
      0.807692
      0.782609
      0.777778
      0.636364
      0.6875
      0.923077
      0.714286
      0.869565
      0.700000
      0.775601

3.3.5 Debugging matcher

We observe that the best matcher is not getting us to the precision that we expect (i.e > 95%). We debug the matcher to see what might be wrong.

To do this, first we split the feature vectors into train and test.



In [40]:

    
# # Split feature vectors into train and test
train_test = em.split_train_test(feature_vectors_dev, train_proportion=0.5)
train = train_test['train']
test = train_test['test']



In [41]:

    
# Debug decision tree using GUI
em.vis_debug_rf(rf, train, test, 
        exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
        target_attr='gold')

From the GUI, we observe that using only 'year' related features result in a lot of false positives. So we decide to use all the features in the feature table (which had author, title and year related features).



In [42]:

    
# Select all features from the feature table
feature_subset_iter2 = feature_table

Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.



In [43]:

    
# Get new set of features
feature_vectors_dev = em.extract_feature_vecs(development, feature_table=feature_subset_iter2, attrs_after='gold')









    



0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [44]:

    
# Check if imputation is required
any(pd.isnull(feature_vectors_dev))









    Out[44]:





True



In [45]:

    
# Impute feature vectors
feature_vectors_dev = em.impute_table(feature_vectors_dev, 
                exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
                strategy='mean')



In [46]:

    
# Apply cross validation to find if there is a better matcher
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
        k=10,
        target_attr='gold', 
        metric_to_select_matcher='precision',
        random_state=0)



In [47]:

    
result['cv_stats']









    Out[47]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      0.993333
      0.981746
      0.987134
    
    
      1
      RF
      1.000000
      1.000000
      1.000000
    
    
      2
      SVM
      0.964150
      0.901706
      0.929124
    
    
      3
      NB
      1.000000
      0.970992
      0.985044
    
    
      4
      LogReg
      0.979737
      0.976190
      0.976707
    
    
      5
      LinReg
      0.981344
      0.987302
      0.984071



In [48]:

    
result['drill_down_cv_stats']['precision']









    Out[48]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Fold 6
      Fold 7
      Fold 8
      Fold 9
      Fold 10
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x7f95caa34710>
      10
      1.0
      1.0
      1.000000
      0.933333
      1.000000
      1.000000
      1.0
      1.0000
      1.0
      1.000000
      0.993333
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x7f95caa347f0>
      10
      1.0
      1.0
      1.000000
      1.000000
      1.000000
      1.000000
      1.0
      1.0000
      1.0
      1.000000
      1.000000
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x7f95caa34748>
      10
      1.0
      1.0
      0.941176
      1.000000
      1.000000
      0.916667
      1.0
      0.9375
      1.0
      0.846154
      0.964150
    
    
      3
      NB
      <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x7f95caa34860>
      10
      1.0
      1.0
      1.000000
      1.000000
      1.000000
      1.000000
      1.0
      1.0000
      1.0
      1.000000
      1.000000
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x7f95caa34908>
      10
      1.0
      1.0
      0.947368
      1.000000
      0.933333
      0.916667
      1.0
      1.0000
      1.0
      1.000000
      0.979737
    
    
      5
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x7f95caa34978>
      10
      1.0
      1.0
      0.947368
      1.000000
      1.000000
      1.000000
      1.0
      0.9375
      1.0
      0.928571
      0.981344

Now, observe the best matcher is achieving the expected precision and we can proceed on to evaluating the best matcher on the unseen data (the evaluation set).

3.4 Evaluating the matching output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:

Extracting the feature vectors
Training matcher using the feature vectors extracted from the development set
Predicting the evaluation set using the trained matcher
Evaluating the predicted matches

3.4.1 Extracting the feature vectors

As before, we extract the feature vectors (using the updated feature table and the evaluation set) and impute it (if necessary).



In [49]:

    
# Get new set of features
feature_vectors_eval = em.extract_feature_vecs(evaluation, 
                                               feature_table=feature_subset_iter2, 
                                               attrs_after='gold')









    



0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00



In [50]:

    
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.isnull(feature_vectors_eval))









    Out[50]:





True



In [51]:

    
# Impute feature vectors
feature_vectors_eval = em.impute_table(feature_vectors_eval, 
                exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
                strategy='mean')

3.4.2 Training the matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.



In [52]:

    
# Train using feature vectors from the development set
rf.fit(table=feature_vectors_dev, 
       exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'], 
       target_attr='gold')

3.4.3 Predicting the matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).



In [53]:

    
# Predict M 
predictions = rf.predict(table=feature_vectors_eval, 
                         exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'], 
                         append=True, 
                         target_attr='predicted', 
                         inplace=False)

3.4.4 Evaluating the matching output

Finally, we evaluate the predicted outputs



In [54]:

    
# Evaluate the result
eval_result = em.eval_matches(predictions, 'gold', 'predicted')
em.print_eval_summary(eval_result)









    



Precision : 96.88% (62/64)
Recall : 100.0% (62/62)
F1 : 98.41%
False positives : 2 (out of 64 positive predictions)
False negatives : 0 (out of 61 negative predictions)

	id	title	authors	venue	year
0	journals/sigmod/Mackay99	Semantic Integration of Environmental Models for Application to Global Information Systems and D...	D. Scott Mackay	SIGMOD Record	1999.0
1	conf/vldb/PoosalaI96	Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing	Viswanath Poosala, Yannis E. Ioannidis	VLDB	1996.0

	id	title	authors	venue	year
0	304586	The WASA2 object-oriented workflow management system	Gottfried Vossen, Mathias Weske	International Conference on Management of Data	1999
1	304587	A user-centered interface for querying distributed multimedia databases	Isabel F. Cruz, Kimberly M. James	International Conference on Management of Data	1999

	_id	similarity	ltable_id	rtable_id	ltable_title	ltable_authors	ltable_venue	rtable_title	rtable_authors	rtable_venue
0	0	0.950000	journals/sigmod/KoubarakisTID03	945736	Selective information dissemination in P2P networks: problems and solutions	Stratos Idreos, Manolis Koubarakis, Christos Tryfonopoulos, Yannis Drougas	SIGMOD Record	Selective information dissemination in P2P networks: problems and solutions	Manolis Koubarakis, Christos Tryfonopoulos, Stratos Idreos, Yannis Drougas	ACM SIGMOD Record
1	1	0.941176	journals/sigmod/LabrinidisR00	344794	Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl	Alexandros Labrinidis, Nick Roussopoulos	SIGMOD Record	Generating dynamic content at database-backed web servers: cgi-bin vs. mod_perl	Alexandros Labrinidis, Nick Roussopoulos	ACM SIGMOD Record
2	2	0.923077	journals/sigmod/Chen94	187460	Database Research at NTHU and ITRI	Arbee L. P. Chen	SIGMOD Record	Database research at NTHU and ITRI	Arbee L. P. Chen	ACM SIGMOD Record
3	3	0.909091	journals/sigmod/MedeirosP94	181566	Databases for GIS	Claudia Bauzer Medeiros, Fatima Pires	SIGMOD Record	Databases for GIS	Claudia Bauzer Medeiros, Fatima Pires	ACM SIGMOD Record
4	4	0.900000	journals/sigmod/Kosch02	565123	MPEG-7 and Multimedia Database Systems	Harald Kosch	SIGMOD Record	MPEG-7 and multimedia database systems	Harald Kosch	ACM SIGMOD Record

	_id	ltable_id	rtable_id	ltable_title	ltable_authors	ltable_year	rtable_title	rtable_authors	rtable_year
0	0	journals/tods/LechtenborgerV03	304586	On the computation of relational view complements	Gottfried Vossen, Jens Lechtenbrger	2003.0	The WASA2 object-oriented workflow management system	Gottfried Vossen, Mathias Weske	1999
1	1	journals/sigmod/CherniakV03	304586	Reminiscences on Influential Papers	Mitch Cherniack, Kenneth A. Ross, Gottfried Vossen	2003.0	The WASA2 object-oriented workflow management system	Gottfried Vossen, Mathias Weske	1999

	_id	similarity	ltable_id	rtable_id	ltable_title	ltable_authors	ltable_venue	rtable_title	rtable_authors	rtable_venue
0	0	0.736842	journals/sigmod/ACT-NET96	234896	ACT-NET - The Active Database Management System Manifesto: A Rulebase of ADBMS Features	?	SIGMOD Record	The active database management system manifesto: a rulebase of ADBMS features	Corporate Act-Net Consortium	ACM SIGMOD Record
1	1	0.571429	conf/vldb/Team00	671838	High-Performance and Scalability through Application Tier,In-Memory Data Management	?	VLDB	High-Performance and Scalability through Application Tier,In-Memory Data Management	NaN	Very Large Data Bases
2	2	0.500000	journals/sigmod/Dogac98	945727	Guest Editor's Introduction	Asuman Dogac	SIGMOD Record	Guest editor's introduction	Karl Aberer	ACM SIGMOD Record

	_id	ltable.id	rtable.id	year_year_exm	year_year_anm	year_year_lev_dist	year_year_lev_sim	gold
31	31	conf/sigmod/Chamberlin03	872877	1.0	1.0	2.0	0.666667	1
180	180	conf/vldb/GoldmanSVG98	671346	1.0	1.0	2.0	0.666667	1
163	163	conf/vldb/ChaudhuriW00	671696	1.0	1.0	2.0	0.666667	1

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	0.775601	1.000000	0.870943
1	RF	0.775601	1.000000	0.870943
2	SVM	0.775601	1.000000	0.870943
3	NB	0.781584	0.970992	0.863488
4	LogReg	0.781584	0.970992	0.863488
5	LinReg	0.775601	1.000000	0.870943

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Fold 6	Fold 7	Fold 8	Fold 9	Fold 10	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x7f95caa34710>	10	0.857143	0.807692	0.782609	0.777778	0.636364	0.6875	0.923077	0.714286	0.869565	0.700000	0.775601
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x7f95caa347f0>	10	0.857143	0.807692	0.782609	0.777778	0.636364	0.6875	0.923077	0.714286	0.869565	0.700000	0.775601
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x7f95caa34748>	10	0.857143	0.807692	0.782609	0.777778	0.636364	0.6875	0.923077	0.714286	0.869565	0.700000	0.775601
3	NB	<py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x7f95caa34860>	10	0.900000	0.807692	0.772727	0.777778	0.650000	0.6875	0.920000	0.714286	0.863636	0.722222	0.781584
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x7f95caa34908>	10	0.900000	0.807692	0.772727	0.777778	0.650000	0.6875	0.920000	0.714286	0.863636	0.722222	0.781584
5	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x7f95caa34978>	10	0.857143	0.807692	0.782609	0.777778	0.636364	0.6875	0.923077	0.714286	0.869565	0.700000	0.775601

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	0.993333	0.981746	0.987134
1	RF	1.000000	1.000000	1.000000
2	SVM	0.964150	0.901706	0.929124
3	NB	1.000000	0.970992	0.985044
4	LogReg	0.979737	0.976190	0.976707
5	LinReg	0.981344	0.987302	0.984071